Data Integrity Suite

The suite is composed of various checks such as: Identifier Label Correlation, Is Single Value, Feature Feature Correlation, etc...
Each check may contain conditions (which will result in pass / fail / warning ! / error ) as well as other outputs such as plots or tables.
Suites, checks and conditions can all be modified. Read more about custom suites.


Conditions Summary

Status Check Condition More Info
Feature Label Correlation Features' Predictive Power Score is less than 0.8 Found 2 out of 4 features with PPS above threshold: {'petal.width': '0.93', 'petal.length': '0.86'}
Feature-Feature Correlation Not more than 0 pairs are correlated above 0.9 Correlation is greater than 0.9 for pairs [('petal.length', 'petal.width')]
Data Duplicates Duplicate data ratio is less or equal to 5% Found 0.67% duplicate data
Single Value in Column Does not contain only a single value Passed for 5 relevant columns
Special Characters Ratio of samples containing solely special character is less or equal to 0.1% Passed for 5 relevant columns
Mixed Nulls Number of different null types is less or equal to 1 Passed for 5 relevant columns
Mixed Data Types Rare data types in column are either more than 10% or less than 1% of the data 5 columns passed: found 0 columns with negligible types mix, and 5 columns without any types mix
String Mismatch No string variants Passed for 1 relevant column
String Length Out Of Bounds Ratio of string length outliers is less or equal to 0% Passed for 1 relevant column
Conflicting Labels Ambiguous sample ratio is less or equal to 0% Ratio of samples with conflicting labels: 0%

Check With Conditions Output

Feature Label Correlation

Return the PPS (Predictive Power Score) of all features in relation to the label. Read More...

Conditions Summary
Status Condition More Info
Features' Predictive Power Score is less than 0.8 Found 2 out of 4 features with PPS above threshold: {'petal.width': '0.93', 'petal.length': '0.86'}
Additional Outputs
The Predictive Power Score (PPS) is used to estimate the ability of a feature to predict the label by itself (Read more about Predictive Power Score). A high PPS (close to 1) can mean that this feature's success in predicting the label is actually due to data leakage - meaning that the feature holds information that is based on the label to begin with.

Go to top

Feature-Feature Correlation

Checks for pairwise correlation between the features. Read More...

Conditions Summary
Status Condition More Info
Not more than 0 pairs are correlated above 0.9 Correlation is greater than 0.9 for pairs [('petal.length', 'petal.width')]
Additional Outputs
* Displayed as absolute values.

Go to top

Data Duplicates

Checks for duplicate samples in the dataset. Read More...

Conditions Summary
Status Condition More Info
Duplicate data ratio is less or equal to 5% Found 0.67% duplicate data
Additional Outputs
0.67% of data samples are duplicates.
Each row in the table shows an example of duplicate data and the number of times it appears.
    sepal.length sepal.width petal.length petal.width variety
Instances Number of Duplicates          
142, 101 2 5.80 2.70 5.10 1.90 Virginica

Go to top

Check Without Conditions Output

Outlier Sample Detection

Detects outliers in a dataset using the LoOP algorithm. Read More...

Additional Outputs
The Outlier Probability Score is calculated by the LoOP algorithm which measures the local deviation of density of a given sample with respect to its neighbors. These outlier scores are directly interpretable as a probability of an object being an outlier (see link for more information).

  Outlier Probability Score sepal.length sepal.width petal.length petal.width variety
41 0.91 4.50 2.30 1.30 0.30 Setosa
106 0.69 4.90 2.50 4.50 1.70 Virginica
109 0.67 7.20 3.60 6.10 2.50 Virginica
6 0.63 4.60 3.40 1.40 0.30 Setosa
59 0.61 5.20 2.70 3.90 1.40 Versicolor

Go to top

Other Checks That Weren't Displayed

Check Reason
Identifier Label Correlation - Train Dataset DatasetValidationError: Dataset does not contain an index or a datetime. see Dataset docs
Single Value in Column Nothing found
Mixed Nulls Nothing found
Special Characters Nothing found
Mixed Data Types Nothing found
String Mismatch Nothing found
String Length Out Of Bounds Nothing found
Conflicting Labels Nothing found

Go to top